Growing Decision Trees Less Greedily
نویسنده
چکیده
Most algorithms which induce model structure from sample data proceed, to varying degrees, "greedily". That is, they sequentially add to the current model the candidate component which works best with the existing structure. (Such components include a linear term with stepwise regression, a small polynomial with GMDH-like methods2, or a threshold split with decision trees.) This greedy search procedure is relatively fast, but is not optimal, as there can exist models within the "reachable" space which have less complexity and/or greater accuracy on the training data. Indeed, this difference in training performance between optimal and greedy models can be surprisingly large. Still, it is not clear how much greediness hurts in practice, and whether greedy models typically under perform on unseen, but similar data. Here, we review example effects of greediness in regression to motivate study of the issue with another popular model form: decision trees. A new tree algorithm, "Texas Two-Step", is introduced which looks ahead one more generation than standard procedures. In other words, it judges a potential split not by how the resulting child nodes turn out, but by how the grandchildren do. Preliminary results are compared on a recent field application: identifying a bat's species by its chirps. 1. Automated Induction Inductive algorithms are, at one level, "black boxes" for developing classification, estimation, or control models from sample data. They automatically search a vast space of potential models for the best inputs, structure (terms and interconnections), and parameter values. The models are pieced together in a stepwise manner into a feed-forward network (e.g., tree) of simple nodes. The better methods also prune unnecessary terms or nodes from the model, thereby regulating complexity to reduce the chance of overfit. Overfit models are over-specialized to the training 1This work was partially supported by an NSF Research Associateship in Computational Science and Engineering. 2Group Method of Data-Handling (Ivakhenko, 1968). See also the book edited by Farlow, 1984. data and generalize poorly (fail on new data). This is widely held to be the chief danger of using inductive methods. Complexity is regulated either through 1) term penalties, as with model selection criteria such as Cp (Mallows, 1973) and Minimum Description Length, MDL (Rissanen, 1978), 2) roughness penalties (integrated second derivatives of the estimation surface), or 3) tests on withheld data (e.g., V-fold cross-validation). The penalties add to an error measure, and models having the lowest combined score are judged the best candidates for use. Stepwise regression can be considered a low-level automated induction algorithm. Though the set of possible models (linear combinations of a subset of original candidate inputs) is quite constrained, the procedure does identify which variables to employ and can increase or reduce the size of the set under consideration. In contrast, Artificial Neural Networks (ANNs) are not inductive methods by the definition used here, as their structure is fixed a priori.3 They can more precisely be viewed as a class of nonlinear models whose parameters are typically set through a local gradient search called back-propagation.4 (One suspects that ANNs, which can perform well even when they appear over-parameterized, may avoid overfit partly because of the weakness of this search algorithm! It is possible that improvement of the search procedure without simplification of the model structure may result in better training but worse out-of-sample performance.)5 Leading automated induction methods, using "building blocks" consisting of logistic functions, splines, polynomials, planes, non-parametric smoothes of weighted sums, etc. -are briefly described in (Elder, 1993) along with their chief strengths and weaknesses. Here, we focus on one of the 3Removing small terms within ANN nodes does not address over-parameterization, where useless terms can appear significant though their coefficients collectively cancel. (The dangers of collinear variables in regression are analogous.) 4This iterative search converges relatively slowly to a local minimum in parameter space, and it has recently been shown (Mulier and Cherkassky, 1993) that the presentation order of the data affects the particular minimum found. 5If this danger is real, then the "greedy" nature of the gradient search may have benefits as well.
منابع مشابه
Evolution of Decision Trees
This paper addresses the issue of the induction of orthogonal, oblique and multivariate decision trees. Algorithms proposed by other researchers use heuristic, usually based on the information gain concept, to induce decision trees greedily. These algorithms are often tailored for a given tree type (e.g orthogonal), not being able to induce other types of decision trees. Our work presents an al...
متن کاملA New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining
Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...
متن کاملDecision Tree Induction : How E ective is the Greedy Heuristic ?
Most existing decision tree systems use a greedy approach to induce trees | locally optimal splits are induced at every node of the tree. Although the greedy approach is suboptimal, it is believed to produce reasonably good trees. In the current work, we attempt to verify this belief. We quantify the goodness of greedy tree induction empirically, using the popular decision tree algorithms, C4.5...
متن کاملSpindle trees (Euonymus japonica Thunb.) growing in a polluted environment are less sensitive to gamma irradiation
Background: Spindle trees (Euonymus japonica Thunb.) growing in an industrial complex area containing pollutants is chronically injured thus need to build up their resistance. Antioxidant enzymes and cell membrane stability have been widely used to differentiate stress tolerance. Materials and Methods: Leaves of spindle trees from a clean control area (Kijang) and an industrial area (Onsan) whe...
متن کاملDecision Tree Induction: How Effective is the Greedy Heuristic?
Most existing decision tree systems use a greedy approach to induce trees -locally optimal splits are induced at every node of the tree. Although the greedy approach is suboptimal, it is believed to produce reasonably good trees. In the current work, we attempt to verify this belief. We quantify the goodness of greedy tree induction empirically, using the popular decision tree algorithms, C4.5 ...
متن کاملConsistency of Probabilistic Classifier Trees
Label tree classifiers are commonly used for efficient multiclass and multi-label classification. They represent a predictive model in the form of a tree-like hierarchy of (internal) classifiers, each of which is trained on a simpler (often binary) subproblem, and predictions are made by (greedily) following these classifiers’ decisions from the root to a leaf of the tree. Unfortunately, this a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004